Introduction

The significant progress in both AI and genomics mutually supports the rapid advancements in the medical field. In this study, we will employ various machine learning and AI methods on gene expression data obtained from cancer cells. We aim to comprehend how these cells behave in micro-environmental conditions, specifically focusing on predicting oxygen levels: hypoxia (low) and normoxia (normal). To achieve this, we will construct a model using single-cell RNA sequencing data.

Materials and Methods

Materials

The data that we analyzed was obtained from 4 experiments where two different cancer cell lines, MCF7 and HCC1806, were studied. Each was sequenced by two different RNA sequencing technologies: SMARTSeq and DropSeq.

Methods

Python libraries

Exploratory data analysis

EAD1: MCF7 Cell line

MCF7 -- Meta Data

The indices of the data frame mcf7_smarts_metadata denote the filenames that represent polyeptid chains defining each cell studied. The same dataframe contains 8 columns:

We can see that each filename is created by combining the information contained in some of the columns. For example, the fist row has the filename output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam which stands for the analysis of the cell S28 in the position A10 that is preprocessed by alining and storing by the coordinates and the oxygen level of the cell Hypoxia.

MCF7 -- Unfiltered Data

Each column of the dataframe mcf7_smarts_unfiltered (383 columns) corresponds to the row of the dataframe mcf7_smarts_meta (383 rows). Thus, in our dataset we know the quantity of the expressed genes in each file (i.e. the aligned polyeptid sequences).

The indices of this unfiltered dataframe represent a gene name (WASH7P, MT-TT, etc.) that is a special ID known as gene symbols. These gene symbols are only acroynms, that might not be unique. Later on we will analyze the correlation between the rows (gene expressions) to understand if we have duplicated gene expressions under different acroynms.

The unfiltered dataframe contains only numeric information:

Do we have any missing data? We can look at the null values row by row, but since we have 383 rows, we look at the total sum of the missing values in each row:

There are no missing values in our dataframe, therefore there is no need for imputation.

We can look at the descriptive statistics of our dataframe. We analyze the distributions of single cells as in the example report:

Just by giving a quick look at the distributions of single cells, we see that many of them are highly rigt skewed as they are filled with 0 values. They are not normalized, they do not have unit variance and 0 mean, their std deviation is higly large compared to the mean.

Let's choose 10 random variables, to visualize their not-normal distributions:

The plots confirm what we wrote about the distribution characterstic: we observe an elongated right tail.

We continue exploratory data analysis with investigating outleirs. We choose to use IQR formula to detect the outliers. Anything outside this range, will be dropped:

Using interquartile range method to remove the outliers results in taking away 72% of our dataset which is not a desired outcome. As we observed above, many observations are filled with 0s.

We can quantify sparisty this way: if X% (where X>threshold) of the gene expressions of an observation is 0, then we consider that observation to be highly selective of some specific gene expressions and hence sparse.

If an entire dataset has mostly sparse observation then we can say it is a sparse structure.

We defined two thresolds for sparsity: 95% and 50%.

We see that 91% of the single cells have more than half of the gene expressions as 0, which means we do not have densely expressive cells.

We can do the same analysis for the sparsity of the features themselves:

We see that almost 30% of cells are not expressed in at least half of the single cells, 18% percent of them are expressed in only 5% of the single cells.

As we noted earlier, looking at the descriptive statistics and some density plots, the variables are highly centered around zero. Let's quantify the skewness and kurtosis:

As suggested by the histograms, the data is far from normal distribution with many samples having high skewness and kurtosis values.

This is not good because when the data is not following a normal distribution, it might violate some assumptions of a machine learning model. Or it might just make it hard for the algorithm to detect the differences among non 0 values.

One way to make the distribution less skewed is to apply a log transformation:

The same sample before log transformation is heavily centered around 0:

Now lets take only the first 50 columns/cells (for speed), and calculate skewness and kurtosis after applying log transformation to them:

Now most of the variables have skewness score around 0 as we expected after applying the log transformation

If we compare the density plots with the ones above before the transformation, we see them changed, the distributions become more bimodal.

Let's normalize the data between cells with Normalizer transformer of sklearn:

I am loading the filtered and filtered+normalized datasets to make a comparison as requested:

The samples in the filtered dataset has the same shape of the distribution:

Normalized that dataset has has smaller std, the range of values it accepts is shorter.

To understand which genes convey the same information, we can check their correlations.

We create the dataset without duplicates

We removed less than 1% of the dataset

Data structure after EDA

We are investigating the correlations between the samples (i.e. the correlation between gene expression profiles of different cells):

We see that the correlation matrix of cells contains high values and therefore is mostly red. There are some white striped that indicate the presence of cells that are not correlated with other cells.

For each cell we calculate how many low correlated cells there are. For low correlation we defined the correlation threshold as a range between +/- 0.2:

Let's define the 'uncorrelated cell group' as the group of cells that have low correlation with at least half of the other cells:

14 cells are expressing very different gene profiles (i.e. their correlations with almost all of the cells is between -0.2 and 0.2)

Low correlated group of cells are having 0 in at least half of their data points, some have even 3/4 of their distribution as 0.

We can also look at the cells that are highly correlated with other cells in the same way. We define the high correlation threshold as values greater than 0.75 and less than -0.75:

Half of the cells are highly correlated with at least half of the other cells

These cells above are correlated with more than half of the cells.

Let's look at the correlation between Hypoxia cells:

Let's look at the correlation between Normal cells:

The average correlation within the two cell groups (low oxygen condition cells and high oxygen condition) is similar. That means high oxygen cells are not more similar to each other than how much similar low oxygen cells to each other.

We choose 5 random cells from high oxygen condition and then 5 random cells from low oxygen condition and look at their distributions:

In no hypoxia condition, we chose 4 random cells that have high correlations with other cells, and one cell (in purple) that has lower correlation with other no hypoxia cells

In hypoxia condition, we chose 4 random cells that have high correlations with other cells, and one cell (in orange) that has lower correlation with other hypoxia cells

We also check the correlations between the features (i.e. the expressions of different genes) as requested.

It take too long to check all the features, therefore we use only 5% of the features for this exercise:

Just looking at the first 20 features, we notice on the correlation matrix red areas that indicate high positive correlations. We also notice some negative correlations (but not that high).

Features with high correlations can be problematic for some machine learning algorithms. This problem is known as multicollinearity. In order to solve it, among a highly correlated pair of features, one should not be used in the model.

---------------------- Memory Cleaning Start --------------------------------

---------------------- Memory Cleaning End --------------------------------

EAD2: HCC1806 SmartSeq experiment

HCC1806 -- Meta Data

We are now proceding with the second data set. It can be seen that, the structure of the data is exactly the same. As stated for the previous mcf7_smarts_metadata, the indices of the the indices of the data frame HCC1806_smarts_metadata denote the filenames that represent polyeptid chains defining each cell studied. Again, like the mcf7_smarts_metadata, our present dataframe contains 8 columns:

We can see that each filename is created by combining the information contained in some of the columns. For example, the fist row has the filename output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam which stands for the analysis of the cell S123 in the position A10 that is preprocessed by alining and storing by the coordinates and the oxygen level of the cell Normoxia.

HCC1806 -- Unfiltered Data

Each column of the dataframe HCC1806_smarts_unfiltered (243 columns) corresponds to the row of the dataframe HCC1806_smarts_meta (243 rows). Thus, in our dataset we know the quantity of the expressed genes in each file (i.e. the aligned polyeptid sequences).

The indices of this unfiltered dataframe represent a gene name (WASH7P, MT-TT, etc.) that is a special ID known as gene symbols. These gene symbols are only acroynms, that might not be unique. Below we verify that the dataset contains only unique gene acronyms:

The unfiltered dataframe contains only numeric information:

Do we have any missing data? We can look at the null values row by row, but since we have 243 rows, we look at the total sum of the missing values in each row:

There are no missing values in our dataframe, therefore there is no need for imputation.

We can look at the descriptive statistics of our dataframe:

Just by giving a quick look at the distributions of the features, we see that many variables are highly rigt skewed as they are filled with 0 values. The features are not normalized, they do not have unit variance and zero mean, std deviation is higly large compared to the mean. Their distributions have an elongated right tail.

Let's choose 10 random variables, to visualize their not-normal distributions:

We continue exploratory data analysis with investigating outleirs. We choose to use IQR formula to detect the outliers. Anything outside this range, will be dropped:

Using interquartile range method to remove the outliers results in taking away 54% of our dataset which is not a desired outcome. As we observed above, many features are filled with 0s.

We can quantify sparisty this way: if X% of the observations of a variable is 0 then it is not a very informative feature. If a dataframe has mostly sparse features then we can say it is a sparse structure.

We defined two thresolds for sparsity: 95% and 90%. Decreasing the thresold, we find more sparse features. This can be advantageous because it would help the algorithms detect easily the difference between two classes (assuming that one class has 0 values and the positive class perhaps has non-zero values).

As we noted earlier, looking at the descriptive statistics and some density plots, the variables are highly centered around zero. Let's quantify the skewness and kurtosis:

As suggested by the histograms, the data is far from normal distribution with many features having high skewness and kurtosis values.

This is not good because when the data is not following a normal distribution, it might violate some assumptions of a machine learning model. Or it might just make it hard for the algorithm to detect the differences among non 0 values.

One way to make the distribution less skewed is to apply a log transformation:

The same feature before log transformation is heavily centered around 0:

Now lets take only the first 50 columns (for speed), and calculate skewness and kurtosis after applying log transformation to them:

Now most of the variables have skewness score around 0 as we expected after applying the log transformation

If we compare the density plots with the ones above before the transformation, we see them changed, the distributions become more bimodal.

I am loading the filtered and filtered+normalized datasets to make a comparison as requested:

The variables in the filtered dataset seem very much like the ones in the unfiltered dataset before log normalization.

As suggested, we move on with checking for duplicate rows.

To understand which genes convey the same information, we can check their correlations.

We create the dataset without duplicates

We removed less than 1% of the dataset

Data structure after EDA

We are investigating the correlations between the samples:

We see that the correlation matrix of cells contains high values and therefore is mostly red. There are some white striped that indicate the presence of cells that are not correlated with other cells.

For each cell we calculate how many low correlated cells there are. For low correlation we defined the correlation threshold as a range between +/- 0.2:

Let's define the 'uncorrelated cell group' as the group of cells that have low correlation with at least half of the other cells:

8 cells are expressing very different gene profiles (i.e. their correlations with almost all of the cells is between -0.2 and 0.2)

Low correlated group of cells are having 0 in at least half of their data points with high enough standard deviations.

We can also look at the cells that are highly correlated with other cells in the same way. We define the high correlation threshold as values greater than 0.75 and less than -0.75:

81% of the cells are highly correlated with at least half of the other cells.

These cells above are correlated with more than half of the cells.

So far we looked at the correlation between different samples (cells) and then we tried to understand if there is a cluster of cells that are different than the rest and if there is a cluster of cells that are very similar to each other using correlation measure.

Now instead let's look at the correlation between two groups of low and high Hypoxia cells:

Let's look at the correlation between Normal cells:

The average correlation within the two cell groups (low oxygen condition cells and high oxygen condition) is similar.

That means high oxygen cells are not more similar to each other than how much similar low oxygen cells to each other.

We choose 5 random cells from high oxygen condition and then 5 random cells from low oxygen condition and look at their distributions:

Both for no hypoxia and hypoxia conditions, we chose 4 random cells. The cells visualized how high correlations with other cells as visualized.

We also check the correlations between the features (i.e. the expressions of different genes) as requested. It take too long to check all the features, therefore we use only 5% of the features for this exercise:

Just looking at the first 20 features, we notice on the correlation matrix red areas that indicate high positive correlations. We also notice some negative correlations (but not that high). Features with high correlations can be problematic for some machine learning algorithms. This problem is known as multicollinearity. In order to solve it, among a highly correlated pair of features, one should not be used in the model.

-------------- Memory Cleaning Start --------------

-------------- Memory Cleaning End --------------

Unsupervised learning

UL1: MCF7 Cell line

I am loading the train set with 3000 features:

We want single cells to be our observations, and the gene expressions to be the features. So we transpose the dataset:

I am loading the test set and transposing it:

UL1/Dimensionality Reduction: MCF7 Cell line -- 1. PCA

First unsupervised learning technique we will use to find the hidden patterns in the data is PCA.

PCA is a dimensionality reduction technique which projects the data into a different vector space in the direction of the maximum variance:

We standardize each feature as usually applied for PCA analysis:

Now that our data set 0 mean and unit variance, we can apply PCA transformation.

We are not indicating the number of components, according to the documentation, since the max number of samples is less than the number of features, the PCA algorithm of sklearn will give us components as many as the number of samples.

We will do analyses later on to understand the relationship between the information in the dataset and the optimal number of components to use:

We plot the cumulative variance of the first then principal components:

The first ten principal components explain 24% of variance of the dataset. If we use only the first two principal components, alone they are able to reflect the 10% of the total variance. Including the third component brings only 3% more information.

We need to use 100 components to account for 70% of the variance in the data. Given that we have 3000 features initally, it can be a good comparison to use only one third of the features.

For this exercise, we can plot Hypoxia cells and no hypoxia cells using the two principal components:

We apply the PCA transformer to the test set:

We do not have the Hypoxia condition of the cells in the test set, we cannot see the separation by condition as we did with the train set. But we would expect the cells on the left of 0 and on the right of 0 to represent the two conditions.

UL1/Dimensionality Reduction: MCF7 Cell line -- 2. Isomap

Isomap is short for Isometric Mapping. Isomap is a non-linear way of reducing dimentionality while preserving local structures. It's actually a combination of different algorithms: k-nearest neighbors (KNN), a shortest path algorithm (which could be the Dijkstra’s algorithm, for example), and Multidimensional Scaling (MDS). Isomap is distinguished from MDS by the preservation of geodesic distances, which results in the preservation of manifold structures in the resulting embedding. The goal of this mapping is to maintain a geodesic distance between two points. Geodesic is more formally defined as the shortest path on the surface itself. ???????

Isomap distinguishes two cases. With only one dimension (dim1=-50) we can differentiate Hypoxia and Normal cells with small error for theobservations between -50 and 0.

UL1/Dimensionality Reduction: MCF7 Cell line -- 3. T SNE (T-distributed Stochastic Neighbor Embedding)

As a third dimensionality reduction method, we will try T-SNE algorithm. T-SNE works well on sets with non-linear variance. It tries to maximize the distance between the probability distributions of similar and dissimilar observations.

TSNE implementation in sklearn does not have transform method so we cannot apply it to the train set to see if there will be well separated clusters.

Comparison of the dimension reduction methods:

UL1/Clustering: MCF7 Cell line -- Kmeans

We can use our pca or tsne transformed dataset to find the clusters with Kmeans algorithm. We used these trasformed dataset to help the algorithm, otherwise we can use the not transformed dataset as well and we tried that way too.

The metric that is minimized usually is called inertia: it is the sum of squared distance of samples to their closest cluster centroid.

A. Kmeans on PCA transformed data:

We choose the use 3 components in PCA to transform the data (there is little difference between 2 and 3 as we explained above):

We calculate intertia for Kmeans from 1 to 10 clusters:

There is a break point when the number of clusters equal to 2, so we decide to obtain 2 groups. We visualize the samples and their cluster centoid.

Performing Kmeans on PCA transformed dataset fo the MCF cell line works well. We found two clusters of Hypoxia and Normal Conditions. In red and in blue we display Hypoxia and Normal conditions (the truth of the dataset) and in black we represent the found cluster centoids which match with the truth of the dataset.

B. Kmeans on T SNE transformed data:

We apply the same to the TSNE transformed data:

2 clusters minimize inertia a lot:

Kmeans on TSNE trasnformed data worked even bettter than PCA. We identify the centers of the two condition groups very well. In red and in blue we display Hypoxia and Normal conditions (the truth of the dataset) and in black we represent the found cluster centoids which match with the truth of the dataset.

C. Kmeans on the original data:

We apply the same to the original data that is not standardized and not PCA transformed:

For the original data, there is a second elbow at x=3:

Three clusters do not work well. What about 2 clusters?

We obtained bad results with the origial dataset. It is better to use standardization and dimension reduction, and do clustering on the dataset that is projected on the essential dimensions found.

Finally we apply the Kmeans to the test set that is PCA transformed with three components to explore:

It looks very much like the figure of Kmeans applied on the PCA transformed traning set.

----- Memory Cleaning Start -----

----- Memory Cleaning End -----

UL2: HCC1806 Cell line

I am loading the train set with 3000 features:

We want single cells to be our observations, and the gene expressions to be the features. So we transpose the dataset:

I am loading the test set and transposing it:

UL2/Dimensionality Reduction: HCC1806 Cell line -- 1. PCA

First unsupervised learning technique we will use to find the hidden patterns in the data is PCA.

PCA is a dimensionality reduction technique which projects the data into a different vector space in the direction of the maximum variance:

We standardize each feature as usually applied for PCA analysis:

Now that our data set 0 mean and unit variance, we can apply PCA transformation.

We are not indicating the number of components, according to the documentation, since the max number of samples is less than the number of features, the PCA algorithm of sklearn will give us components as many as the number of samples.

We will do analyses later on to understand the relationship between the information in the dataset and the optimal number of components to use:

We plot the cumulative variance of the first then principal components:

The first ten principal components explain 24% of variance of the dataset. If we use only the first two principal components, alone they are able to reflect less than 7% of the total variance. Including the third component brings only 3% more information.

We need to use 100 components to account for 80% of the variance in the data. Given that we have 3000 features initally, it can be a good comparison to use only one third of the features.

For this exercise, we can plot Hypoxia cells and No Hypoxia cells using the two principal components:

PCA with two components do not separate the data well enough, worse than its performance with the data of the other cell line.

We apply the PCA transformer to the test set:

There are no well defined clusters observed in the test data.

Let's try again with reducing the train data into three components:

Also with tree components, it is still not enough for us to be able to indentify two different clusters. The third dimension did not add much information.

UL2/Dimensionality Reduction: HCC1806 Cell line -- 2. T SNE (T-distributed Stochastic Neighbor Embedding)

As a second dimensionality reduction method, we will try T-SNE algorithm. T-SNE works well on sets with non-linear variance. It tries to maximize the distance between the probability distributions of similar and dissimilar observations.

We obtain only 2 components.

The conditions are not separable from each other with TSNE solution. We try with three components:

Neither the solution in 3d works well.

Comparison betweeen dimensionality reduction methods

If we compare the two methods, TSNE works less well for HCC1806 data.

UL2/Clustering: HCC1806 Cell line -- Kmeans

We can use our pca or tsne transformed dataset to find the clusters with Kmeans algorithm. We used these trasformed dataset to help the algorithm, otherwise we can use the not transformed dataset as well and we tried that way too.

The metric that is minimized usually is called inertia: it is the sum of squared distance of samples to their closest cluster centroid.

A. Kmeans on PCA transformed data:

We choose the use 3 components in PCA to transform the data (there is little difference between 2 and 3 as we explained above):

We calculate intertia for Kmeans from 1 to 10 clusters:

There is a break point when the number of clusters equal to 4.

We train a Kmeans algorithm to obtain 4 clusters. Then we visualize in 2D the clusters centers with different components of the data to explore if a there is good separation indicated in one of them:

With standardized and PCA transformed dataset of cell type HCC1806, kmeans does not give good result.

B. Kmeans on TSNE transformed data:

We apply the same to the TSNE transformed data:

2 clusters minimize inertia a lot as there is an inital bend. We try the solution with 2 clusters:

This solution is not able to cluster the data into separate groups of oxygen condition. In red and in blue we display Hypoxia and Normal conditions (the truth of the dataset) and in black we represent the found cluster centoids which DO NOT match with the truth of the dataset.

C. Kmeans on the original data:

We apply the same to the standardized original dataset:

We try a solution with 2 clusters. We tried a bunch of pair of features to visualize:

2 clusters solution do not work well. We try with 3 dimensions:

As these do not work well, we try with the original data not standardized. We plot the first and third features. The cluster centers are too close to each other:

We obtained bad results with the origial dataset

It is better to use standardization and dimension reduction, and do clustering on the dataset that is projected on the essential dimensions found.

Conclusion:

Unlike the dataset of MCF1806, for the dataset HC1806 PCA worked better than the kmeans and TSNE techniques to separate the data into two clusters according to Hypnoxia conditions.

--------- Memory Cleaning Start ---------

--------- Memory Cleaning End ---------

Supervised learning: SmartSeq for MCF7 & HCC1806 lines

We will answer the questionsasked on the report template:

Feature Selection: PCA transformation

Supervised Learning Preparation:

SL1: Random Forest

For both cell lines, we fit a Random Forest model. We find the best Random Forest model to fit using cross validation scores of the grid search.

For each combination of values defined in the parameter dictionary, the best model is the one that performs better on average across 3 folds. We do not do more than 3 folds becase we do not have a lot of samples in the dataset.

Random Forest model has better auracy for HCC1806 (77%) than for MCF7 (76%). The cross-validation scores have less std for HCC1806 (a more stable performance across folds for this cell type).

The best models obtained have the same hyperparameters for both cell types. We fit this model:

SL2: LR

For both cell lines, we fit a Logistic Regression model. We find the best Logistic Regression model to fit using cross validation scores of the grid search. For each combination of values defined in the parameter dictionary, the best model is the one that performs better on average across 3 folds. We do not do more than 3 folds becase we do not have a lot of samples in the dataset.

Logistic Regression model predicts Hypoxia condition better for MCF7 cell line than for HCC1802 (100% of accuracy vs 95% and 0 std of performance vs 0.01 std of performance). Logistic regression worked better than Random Forest models for both of the cell lines.

Again, the best models obtained have the same hyperparameters for both cell types. We fit this model:

SL3: Perceptron

For both cell lines, we fit a Perceptron model. We find the best Perceptron model to fit using cross validation scores of the grid search. For each combination of values defined in the parameter dictionary, the best model is the one that performs better on average across 3 folds. We do not do more than 3 folds becase we do not have a lot of samples in the dataset.

Perceptron model predicts Hypoxia condition much better for MCF7 cell line than for HCC1806 (98% of accuracy vs 92% and 0.005 std of performance vs 0.02 std of performance).

Perceptron worked better than Random Forest models, but worked worse than Logistic Regression for both MCF7 and HCC1802 cell lines.

Again, the best models obtained have the same hyperparameters for both cell types. We fit this model:

Comparison of SL models for the data collected with SmartSeq

If we consider the cross validation scores, the best model for both cell types is Logistic Regression:

--------- Memory Cleaning ------- START

--------- Memory Cleaning ------- END

You could test the classifier as predictor in a cell type where it was not developed. Does it predict well? As asked in the report template, we predict the oxygen condition in one dataset using the model built for the other dataset:

The model built for MCF7 works better than a random model on HCC1806 data (accuracy is more than 0.5), but the opposite is not true!

Test Set Predictions

SmartSeq

We save our predictions into separate files as requested. The precitions are under the prediction column:

One single model for both cell types

We build one model by concatenating the datasets of two cell types. We apply the same stepsof standardization and PCA transformation to the entire set.

RF model obtained for all the cells together has average performance of 0.73. This is lower than RF models of individual cell types.

LR model obtained for all the cells together has average performance of 0.98. This is the best model among our all models.

Perceptron model obtained for all the cells together has average performance of 0.89. This is lower than Perceptron models of individual cell types.

One single model for DropSeq Technique

We build one model by concatenating the datasets of two cell types also for DropSeq. We apply the same steps of standardization and PCA transformation to the entire set.

The ranking of the models are still the same: the most efficient is Logistic Regression, then Perceptron and as the least effecient one we have Random Forest. When we compare the models' efficiency among SmartSeq and DropSeq, we got better results at SmartSeq.

DropSeq

We save our predictions into separate files as requested. The precitions are under the prediction column:

Predictions in our report with a short discussion

True Positives (TP) = 25 False Positives (FP) = 0 True Negatives (TN) = 19 False Negatives (FN) = 1

Accuracy = (TP + TN) / (TP + TN + FP + FN) Accuracy = (25 + 19) / (25 + 19 + 1) = 0.977 or 97%

Precision = TP / (TP + FP) Precision = 25 / (25) = 1 or 100%

Recall (Sensitivity) = TP / (TP + FN) Recall = 25 / (25 + 1) = 0.9615 or 96.15%

True Positives (TP) = 32 False Positives (FP) = 0 True Negatives (TN) = 31 False Negatives (FN) = 0

Accuracy = (TP + TN) / (TP + TN + FP + FN) Accuracy = (32 + 31) / (32 + 31 + 0 + 0) = 1 or 100%

Precision = TP / (TP + FP) Precision = 32 / (32 + 0) = 1 or 100%

Recall (Sensitivity) = TP / (TP + FN) Recall = 32 / (32 + 0) = 1 or 100%

In both cases, accuracy, precision, and recall show excellent performance, indicating a high level of accuracy and reliability of the model on both the HCC1806 and MCF7 cell lines.